1 00:00:00,390 --> 00:00:05,930 Hi, everyone. 2 00:00:05,930 --> 00:00:09,380 Welcome back to the Heterogenious Parallel Programming class. 3 00:00:09,380 --> 00:00:13,300 We are at lecture 1.8, Kernel-Based Parallel Programming, 4 00:00:13,300 --> 00:00:17,260 and we will be discussing basic matrix-matrix multiplication. 5 00:00:18,450 --> 00:00:21,400 The objective for this lecture is to prepare 6 00:00:21,400 --> 00:00:25,290 you for the lab assignment of matrix-matrix multiplication. 7 00:00:25,290 --> 00:00:30,810 And we will first review the computation for matrix multiplication. 8 00:00:30,810 --> 00:00:36,350 And we will also discuss how we can do data 9 00:00:37,370 --> 00:00:42,400 block and thread index to data index mapping for this particular computation. 10 00:00:42,400 --> 00:00:46,060 And then we will introduce loop control flow in kernels. 11 00:00:49,098 --> 00:00:53,980 Matrix multiplication is a fairly frequently 12 00:00:53,980 --> 00:00:59,490 used computation in engineering and science applications and the 13 00:00:59,490 --> 00:01:05,050 computation is fairly simple. We have two input matrices A and B. 14 00:01:05,050 --> 00:01:08,535 And then we will generate a product matrix C. 15 00:01:08,535 --> 00:01:14,610 The each C element is generated as a product of a row of A and 16 00:01:14,610 --> 00:01:19,250 a column of B. So we in this particular picture, 17 00:01:19,250 --> 00:01:24,480 we show that a C element has a coordinate row 18 00:01:24,480 --> 00:01:29,360 at Col. And so we would use a row in 19 00:01:29,360 --> 00:01:34,980 A with the index Row, and then we will have a column 20 00:01:34,980 --> 00:01:39,610 of B with the index Col. And we will take the 21 00:01:39,610 --> 00:01:44,460 row and the column to take, to do a vector.product to 22 00:01:44,460 --> 00:01:49,660 generate the value of the C element. In 23 00:01:49,660 --> 00:01:55,120 all the discussions that we have today, we will assume that we would be doing 24 00:01:55,120 --> 00:02:00,270 a we will have A matrix which is of dimension m times n. 25 00:02:01,450 --> 00:02:06,050 And B matrix will be of dimension n 26 00:02:06,050 --> 00:02:07,028 times k. 27 00:02:07,028 --> 00:02:14,330 And then we will generate a C matrix n times k. 28 00:02:14,330 --> 00:02:19,460 So the n part is going to the number of elements in 29 00:02:19,460 --> 00:02:24,530 a row and the number of elements in a column, which is n of B. 30 00:02:25,570 --> 00:02:28,450 Which is represented in variable n here, must 31 00:02:28,450 --> 00:02:31,430 match, because we're doing a .product so we must 32 00:02:31,430 --> 00:02:33,960 have the same number of elements of A and the 33 00:02:33,960 --> 00:02:38,220 same number of elements of B in that in that .product. 34 00:02:38,220 --> 00:02:41,910 So we are going to use these variables 35 00:02:41,910 --> 00:02:45,190 consistently in all the examples in this lecture. 36 00:02:48,180 --> 00:02:54,760 So, here we show a simple sequential C code for matrix multiplication. 37 00:02:54,760 --> 00:02:58,530 And the C code will have have two levels of loops. 38 00:02:58,530 --> 00:03:03,722 And the loop these two levels of loops systematically go 39 00:03:03,722 --> 00:03:08,826 through all the C elements in all the row positions and row, 40 00:03:08,826 --> 00:03:13,667 all the column positions. And for every C element 41 00:03:13,667 --> 00:03:19,166 that it visits it would generate a top product of, of the 42 00:03:19,166 --> 00:03:25,290 A, corresponding A row and B column for that C element. 43 00:03:25,290 --> 00:03:30,400 So here is the C code for generating each C element. 44 00:03:30,400 --> 00:03:33,600 We'll first initialize the c element to 0, and then we will 45 00:03:33,600 --> 00:03:38,970 go and systematically visit all the elements in the row of A and 46 00:03:38,970 --> 00:03:40,490 column of B. 47 00:03:40,490 --> 00:03:43,338 When we visit the row of A, we will 48 00:03:43,338 --> 00:03:48,110 first need to reached the beginning location of that row. 49 00:03:48,110 --> 00:03:53,540 And the beginning location is is can be found by multiplying 50 00:03:53,540 --> 00:03:58,700 variable row, by the number of elements in each row, which is m. 51 00:03:58,700 --> 00:04:04,470 So, row times n moves us into the beginning location of the row 52 00:04:04,470 --> 00:04:12,000 in the linear row major layout of A. And then we do the same thing with B. 53 00:04:12,000 --> 00:04:17,370 In B's case, the beginning location of the column is actually in row 0. 54 00:04:17,370 --> 00:04:23,700 So it's simply, a we can simply use Col to move the 55 00:04:23,700 --> 00:04:29,699 you know the, to move the the, the index into the beginning element of B. 56 00:04:29,699 --> 00:04:35,080 Now, with for every iteration of the I loop, we will need to be able 57 00:04:35,080 --> 00:04:40,780 to calculate the corresponding element within the row and column. 58 00:04:40,780 --> 00:04:43,710 So for A it's very simple. 59 00:04:43,710 --> 00:04:50,170 We all the elements in the in the same row are going to be in adjacent locations. 60 00:04:50,170 --> 00:04:54,780 So all we have to do is to just add i to the beginning index, and that 61 00:04:54,780 --> 00:05:01,130 will give us the the element, in the property element for that iteration. 62 00:05:01,130 --> 00:05:05,747 For B it's a little bit more complex because in row major layout 63 00:05:05,747 --> 00:05:10,753 all the elements in the column are going to be in strided locations. 64 00:05:10,753 --> 00:05:15,768 So we need to multiply i by the number of elements in each row, which is k 65 00:05:15,768 --> 00:05:21,280 for, for B and then we, we add that to the beginning location. 66 00:05:21,280 --> 00:05:26,270 So these two memory accesses give us the appropriate elements 67 00:05:26,270 --> 00:05:30,850 of A and B in each iteration. And then we would do the multiplication. 68 00:05:30,850 --> 00:05:36,240 We accumulate the result into sum. And this once we finish 69 00:05:36,240 --> 00:05:41,740 the for loop, we have the total value for the C element. 70 00:05:41,740 --> 00:05:46,410 So now we can write the C element value into the into the appropriate 71 00:05:46,410 --> 00:05:51,370 C location and this is the location at row and column. 72 00:05:51,370 --> 00:05:56,176 So we again we will just calculate the linearized address by 73 00:05:56,176 --> 00:06:01,960 multiplying row with k, and then add column to the index. 74 00:06:01,960 --> 00:06:07,930 So this gives us the complete C sequential code for matrix multiplication. 75 00:06:07,930 --> 00:06:12,070 Notice that this particular calculation is a very 76 00:06:12,070 --> 00:06:15,590 simple and very basic form. 77 00:06:15,590 --> 00:06:20,540 And since matrix multiplication is a very important computation, there are 78 00:06:20,540 --> 00:06:25,510 actually lots and lots of optimized ways of doing sequential computation. 79 00:06:25,510 --> 00:06:27,330 We will come back to this point later on. 80 00:06:30,370 --> 00:06:35,560 So here we will begin to design a kernel 81 00:06:35,560 --> 00:06:40,460 function in CUDA to perform matrix multiplication in parallel. 82 00:06:40,460 --> 00:06:44,110 So we're going to use a very small example to to 83 00:06:44,110 --> 00:06:48,610 give you the intuition behind the design of the kernel function. 84 00:06:48,610 --> 00:06:52,100 So here we're going to assume that we 85 00:06:52,100 --> 00:06:55,780 are going to be using a two dimensional square, 86 00:06:55,780 --> 00:07:00,800 square block to organize the threads to calculate C. 87 00:07:00,800 --> 00:07:04,230 And that is each thread block is going to have 88 00:07:04,230 --> 00:07:07,430 equal number of threads in both X and Y dimension. 89 00:07:07,430 --> 00:07:11,226 And obviously, this is a simplifying assumption, that 90 00:07:11,226 --> 00:07:14,803 you can you can definitely use a rectangular 91 00:07:14,803 --> 00:07:17,358 block organization, but here we use a, a 92 00:07:17,358 --> 00:07:21,890 square organization, just to keep the, the slide simple. 93 00:07:21,890 --> 00:07:27,170 And so, assuming that we, we have a square organization, 94 00:07:27,170 --> 00:07:33,080 we will just assume that we have a compile time defined constant of tile width. 95 00:07:33,080 --> 00:07:34,950 And this tile width is going to give us the 96 00:07:34,950 --> 00:07:38,233 number of threads in each dimension of the thread block. 97 00:07:38,233 --> 00:07:47,390 And we we, we have a simple example here of C to calculate 98 00:07:47,390 --> 00:07:55,230 a C matrix of 4 times 3 for a 4 by 3 matrix for C. 99 00:07:55,230 --> 00:07:59,402 So in this case, m equal to 4, and K equal to 3. 100 00:07:59,402 --> 00:08:03,702 We also will further assume that n is 4 for both A and B 101 00:08:03,702 --> 00:08:08,830 and which is not visible in this light, but will become useful later on. 102 00:08:08,830 --> 00:08:12,720 So for this very simple small example we 103 00:08:12,720 --> 00:08:18,700 have four elements in the Y dimension and three elements in the X dimension of C. 104 00:08:18,700 --> 00:08:21,080 And we need to make sure that we generate 105 00:08:21,080 --> 00:08:24,830 enough thread blocks to calculate all the C elements. 106 00:08:24,830 --> 00:08:30,280 If we assume that the tile width is 2 for this simple example then we can, we will 107 00:08:30,280 --> 00:08:33,500 go through the same ceiling function calculation and we 108 00:08:33,500 --> 00:08:36,920 would that we use before in the picture kernel. 109 00:08:36,920 --> 00:08:38,120 Then we will, 110 00:08:38,120 --> 00:08:41,045 we will conclude that we need to have two thread 111 00:08:41,045 --> 00:08:43,895 blocks in the Y dimension as well as two thread 112 00:08:43,895 --> 00:08:46,670 blocks in the X dimension to be able to to 113 00:08:46,670 --> 00:08:50,734 cover the generation of, calculation of all the C elements. 114 00:08:50,734 --> 00:08:56,521 So this give us the simple example that we are going to working with. 115 00:08:56,521 --> 00:08:59,797 And in every thread block, we are going to have 116 00:08:59,797 --> 00:09:03,409 two threads in the X dimension and two, two threads 117 00:09:03,409 --> 00:09:09,625 in the Y dimension so that these threads are going to be accessing the C 118 00:09:09,625 --> 00:09:15,837 elements that are located in the pile that this thread block is responsible for. 119 00:09:15,837 --> 00:09:21,981 So now we are ready to to discuss the host code for evoking the matrix 120 00:09:21,981 --> 00:09:28,797 multiplication kernel and we assume that we have the, the matrix 121 00:09:28,797 --> 00:09:34,080 dimension variables m, n and k available in the host code. 122 00:09:34,080 --> 00:09:40,460 So we will be generating in we will be using the ceiling function expression 123 00:09:40,460 --> 00:09:48,870 based on k and n to generate enough thread blocks for covering all the C elements. 124 00:09:48,870 --> 00:09:54,070 So this is just you know, very similar to what we used in the picture kernel. 125 00:09:54,070 --> 00:09:58,130 And for each thread block we are going to declare that we are going to 126 00:09:58,130 --> 00:10:02,918 have thread width and thread width number of threads in the X and Y dimension. 127 00:10:02,918 --> 00:10:06,691 And we will because Z dimension is not used, 128 00:10:06,691 --> 00:10:10,726 we're going to be just using 1 for that dimension. 129 00:10:10,726 --> 00:10:14,506 So we will be launching the kernel 130 00:10:14,506 --> 00:10:19,651 with the grid dimension and block dimension that 131 00:10:19,651 --> 00:10:23,326 we set up in the host code, and then we will 132 00:10:23,326 --> 00:10:27,720 pass m, n, k and the pointers to A, B and C. 133 00:10:27,720 --> 00:10:33,630 Just another reminder A matrix is going to be m times n, 134 00:10:34,730 --> 00:10:41,920 the B matrix is going to be n times k, and C matrix is going to be m times k. 135 00:10:46,940 --> 00:10:49,920 Here we show the matrix multiplication kernel. 136 00:10:49,920 --> 00:10:55,023 And we, at the beginning, every thread is going to determine the 137 00:10:55,023 --> 00:11:00,650 row and column index of the C element that it's it's calculating. 138 00:11:00,650 --> 00:11:04,120 And then they will all proceed to test whether the 139 00:11:04,120 --> 00:11:07,810 row index and column index are within the valid range. 140 00:11:07,810 --> 00:11:12,490 If these indices are within the valid range, then we will 141 00:11:12,490 --> 00:11:17,890 do, go through that for loop to access the elements of the row 142 00:11:17,890 --> 00:11:22,880 of A and the elements of the column of B to produce the inner product. 143 00:11:22,880 --> 00:11:26,810 So this is pretty much the same as the sequential code. 144 00:11:26,810 --> 00:11:32,350 And after the for loop, we, you know, we have the C element value. 145 00:11:32,350 --> 00:11:38,130 So we will, we write a C element value into the global memory, and 146 00:11:38,130 --> 00:11:43,220 the calculation is exactly the same as C. So this is the reason why we took 147 00:11:43,220 --> 00:11:48,237 the time to discuss the row major layout and the linearization of variables. 148 00:11:48,237 --> 00:11:52,917 You know, keep in mind that in, in CUDA because A, B and C 149 00:11:52,917 --> 00:11:57,584 are all going to be dynamically allocated with cudaMalloc. 150 00:11:57,584 --> 00:12:03,317 So that's why we need to you know, do the linearization in 151 00:12:03,317 --> 00:12:09,782 the of these indices when we read from A and B and right into C. 152 00:12:09,782 --> 00:12:12,914 Now we are ready to to review just a 153 00:12:12,914 --> 00:12:18,120 little bit more dynamic execution behavior of the kernel. 154 00:12:18,120 --> 00:12:24,281 So as we mentioned there will be four thread blocks, executing this 155 00:12:24,281 --> 00:12:28,388 kernel for our small 4 by 3 C example. So we 156 00:12:28,388 --> 00:12:33,724 are going here we show the work that the threadblok 0,0 is 157 00:12:33,724 --> 00:12:39,422 going to be doing. So you know what, for threadbox 0,0 158 00:12:39,422 --> 00:12:44,850 the block index at the x and y dimensions are both 0. 159 00:12:44,850 --> 00:12:48,200 So this means that when we calculate the row index and 160 00:12:48,200 --> 00:12:53,790 column index the these values are solely going to be determined 161 00:12:53,790 --> 00:12:56,940 by the thread index in the x and y dimension. 162 00:12:56,940 --> 00:13:02,420 So here we show that that row and column 163 00:13:02,420 --> 00:13:07,840 indices will be 0 and 1 and for all the four threads. 164 00:13:07,840 --> 00:13:12,162 So let's take a look at the work that thread 0,0 is going to do. 165 00:13:12,162 --> 00:13:17,220 Thread 0,0 is going to have row value of 0 and column value of 0. 166 00:13:17,220 --> 00:13:19,600 So it's going to be accessing 167 00:13:19,600 --> 00:13:23,450 the zeros row of a A and the zeros column of B. 168 00:13:23,450 --> 00:13:26,850 We are showing a horizontal arrow in this direction 169 00:13:26,850 --> 00:13:30,010 here and a vertical arrow down the zeros column 170 00:13:30,010 --> 00:13:33,342 of B to indicate all the memory accesses that 171 00:13:33,342 --> 00:13:37,940 thread 0,0 is going to do in block 0,0. 172 00:13:37,940 --> 00:13:44,740 And then, similarly for thread 1,1, it's going to to, to have a 173 00:13:44,740 --> 00:13:50,320 row value of 1 and column value of 1. So it's going to be accessing 174 00:13:50,320 --> 00:13:55,800 row 1 and column 1 row 1 of A and column 1 of B. 175 00:13:55,800 --> 00:14:01,365 So we're also showing this with the horizontal arrow going through the row 176 00:14:01,365 --> 00:14:07,240 1 and vertical arrows going through column 1 in the in the picture. 177 00:14:07,240 --> 00:14:09,920 So this picture shows the work 178 00:14:09,920 --> 00:14:16,100 that all the four threads in block 0,0 will be doing when executing that kernel. 179 00:14:16,100 --> 00:14:22,620 And in fact your, you, your encouraged to use very small numbers, such 180 00:14:22,620 --> 00:14:25,900 as the ones that we provided here to test your kernel and make 181 00:14:25,900 --> 00:14:30,080 sure that you, you can look at all the numbers and review the 182 00:14:30,080 --> 00:14:32,680 correctness of these numbers before you start 183 00:14:32,680 --> 00:14:35,310 to use your kernel on large matrices. 184 00:14:38,440 --> 00:14:44,778 So just for a little bit more insight we would look at the 185 00:14:44,778 --> 00:14:51,530 work for block 0, 1. In this case the block, the Y dimension 186 00:14:51,530 --> 00:14:57,550 the Y dimension block index is still 0, but the X dimension block index is 1. 187 00:14:57,550 --> 00:15:03,140 So in, in this case the row indexes remain the same because the row index, 188 00:15:03,140 --> 00:15:08,780 indexes are calculated based on the y block index which is 0. 189 00:15:08,780 --> 00:15:16,210 However the x index of the block the x dimension of the block index is now 1. 190 00:15:16,210 --> 00:15:22,610 So we effectively shifted the common value of the threads to 191 00:15:22,610 --> 00:15:28,814 2, 3, compared to block 0, 0. So now we see that all 192 00:15:28,814 --> 00:15:32,662 the threads in this thread block are going to 193 00:15:32,662 --> 00:15:38,080 be processing the upper right corner of the C matrix. 194 00:15:38,080 --> 00:15:45,115 And so if we look at thread 0, 0 in this block, it's going to be still 195 00:15:45,115 --> 00:15:52,999 accessing row 0 of A, but it's going to uh,uh, access accessing column 2 of B. 196 00:15:52,999 --> 00:15:54,069 And then you 197 00:15:54,069 --> 00:15:59,228 will generate that product and write into C 0, 2. 198 00:15:59,228 --> 00:16:04,378 And the more interesting situa-, comparison is 199 00:16:04,378 --> 00:16:09,528 that for thread 0, 1 and thre-, thre-, 200 00:16:09,528 --> 00:16:14,608 thread 1, 0 and thread 1, 1. For those two threads 201 00:16:14,608 --> 00:16:19,249 the they are actually going to be processing column 202 00:16:19,249 --> 00:16:23,010 3 of C, which is outside the valid range. 203 00:16:23,010 --> 00:16:24,970 So, both of those threads are going to 204 00:16:24,970 --> 00:16:29,270 fail the condition test of the if statement. 205 00:16:29,270 --> 00:16:32,990 So the, those thread these two threads will not be doing 206 00:16:32,990 --> 00:16:36,690 any work or doing any kind of memory access in the kernel. 207 00:16:37,694 --> 00:16:44,899 ex-, exec-, execute in the kernel. So this gives us a, a good illustration 208 00:16:44,899 --> 00:16:51,961 of the runtime behavior, and I'd like to encourage you to finish the analysis 209 00:16:51,961 --> 00:16:58,176 of, of, of this example for block 1,0 and block 1,1, okay? 210 00:16:58,176 --> 00:17:04,161 So, at this point, we have finished the the coverage 211 00:17:04,161 --> 00:17:09,348 of the simple matrix multiplication kernel in CUDA. 212 00:17:09,348 --> 00:17:10,524 And you 213 00:17:10,524 --> 00:17:15,770 are ready for the lab. And what I like to reiterate is that 214 00:17:15,770 --> 00:17:21,140 when you test your kernel you should start with some very small matrices. 215 00:17:21,140 --> 00:17:26,398 Such as the dimension that we had, we shown in this in this, in this example. 216 00:17:26,398 --> 00:17:31,728 You can even start with even smaller, like 2 by 2 matrices, and so 217 00:17:31,728 --> 00:17:35,910 that you can use the kind of data value that you know for sure, 218 00:17:35,910 --> 00:17:39,026 you can just calculate by hand and verify that 219 00:17:39,026 --> 00:17:42,442 your kernel is, is doing all the right things. 220 00:17:42,442 --> 00:17:46,512 And then before you can you, you apply the kernel to much 221 00:17:46,512 --> 00:17:52,472 bigger matrices that, of the data that we pro-,uh, supply for your testing. 222 00:17:52,472 --> 00:17:58,960 So in, at this point you, we are at the end of week one. 223 00:17:58,960 --> 00:18:01,030 So I like to also encourage 224 00:18:01,030 --> 00:18:04,960 you to take the quiz as soon as possible to make 225 00:18:04,960 --> 00:18:09,420 sure that you solidify all your understanding of the week one material. 226 00:18:09,420 --> 00:18:11,650 For those of you who'd like to learn more, I'd 227 00:18:11,650 --> 00:18:15,852 like to encourage you to read the textbook, section 4.3. 228 00:18:15,852 --> 00:18:17,588 Thank you.